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Abstract 

Mapping small reads to genome reference is an essential and more common approach to identify microRNAs (miRNAs) in 
an organism. Using closely related species genomes as proxy references can facilitate miRNA expression studies in non- 
model species that their genomes are not available. However, the level of error this introduces is mostly unknown, as this is 
the result of evolutionary distance between the proxy reference and the species of interest. To evaluate the accuracy of 
miRNA discovery pipelines in non-model organisms, small RNA library data from a mosquito, Aedes aegypti, were mapped to 
three well annotated insect genomes as proxy references using miRanalyzer with two strict and loose mapping criteria. In 
addition, another web-based miRNA discovery pipeline (DSAP) was used as a control for program performance. Using 
miRanalyzer, more than 80% reduction was observed in the number of mapped reads using strict criterion when proxy 
genome references were used; however, only 20% reduction was recorded for mapped reads to other species known 
mature miRNA datasets. Except a few changes in ranking, mapping criteria did not make any significant differences in the 
profile of the most abundant miRNAs in A. aegypti when its original or a proxy genome was used as reference. However, 
more variation was observed in miRNA ranking profile when DSAP was used as analysing tool. Overall, the results also 
suggested that using a proxy reference did not change the most abundant miRNAs' differential expression profiles when 
infected or non-infected libraries were compared. However, usage of a proxy reference could provide about 67% of the 
original outcome from more extremely up- or down-regulated miRNA profiles. Although using closely related species 
genome incurred some losses in the number of miRNAs, the most abundant miRNAs along with their differential expression 
profile would be acceptable based on the sensitivity level of each project. 

Citation: Etebari K, Asgari S (2014) Accuracy of MicroRNA Discovery Pipelines in Non-Model Organisms Using Closely Related Species Genomes. PLoS ONE 9(1): 
e84747. doi:10.1371/journal.pone.0084747 

Editor: Olle Terenius, Swedish University of Agricultural Sciences, Sweden 

Received August 21, 2013; Accepted November 19, 2013; Published January 3, 2014 

Copyright: © 2014 Etebari, Asgari. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was supported by an Australian Research Council (www.arc.gov.au) Discovery grant to S.A. (DP1 101021 12) and a University of Queensland 
(www.uq.edu.au) PhD scholarship to K.E. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the 
manuscript. 

Competing Interests: The authors have declared that no competing interests exist. 
* E-mail: s.asgari@uq.edu.au 



Introduction 

microRNAs (miRNAs) are small non-coding RNAs of ~22 
nucleotides, which are highly conserved among evolutionarily 
related species, and many even have homologs in distantly related 
species [1]. They have added a new facet of control to the complex 
network of gene transcription pathways and regulate around 30- 
75% of different mRNA transcripts in eukaryote cells [2]. miRNAs 
regulate the expression of target genes by binding to complemen- 
tary sequences in the target mRNA and play important roles in 
various biological processes through post-transcriptional regula- 
tion of gene expression. Differential expression of miRNAs under 
various biological conditions, such as development, immune 
challenge, host-microorganism interactions and stresses has been 
reported in many species [3-1 1]. These characteristics have made 
some miRNAs suitable biomarkers for disease diagnostics [12,13]. 

In recent years, the number of miRNA annotations has 
increased particularly in large and/ or poorly annotated genomes 
[14] due to the large amount of sequencing data that can readily 
be produced by next generation sequencing platforms, such as the 
Illumina and Solexa. Since identifying the first miRNA in 
Caenorhabditis elegans [15], massive numbers of miRNAs have been 
identified in other model organisms. In the last few years, by 



increasing our knowledge of miRNA biology and also significant 
reductions in sequencing costs, the number of research projects 
with a focus on the role of miRNA under different biological 
conditions in non-model organisms has also increased. Due to 
improvements in prediction algorithms, miRNA discovery from 
various non-model organisms has advanced, with 21,264 miRNAs 
known to date (miRBase vl9.0). There are a few technical factors 
such as sequencing accuracy, genomic mapping efficacy, and small 
RNA library preparations, which make small RNA-seq (smRNA- 
seq) data interpretation a daunting task [14,16]. Using this 
technology in a species lacking genomic resources is quite 
challenging due to high levels of small RNA diversity and 
concerns over read mapping accuracy in the absence of a genome 
scaffold. 

In many studies, detection of known/ conserved miRNAs and 
their expression levels is a priority rather than discovery of novel 
miRNAs. In this case, mapping millions of sequencing reads to a 
reference genome can be replaced by aligning these small reads 
against the sequences of known miRNAs in other species. Many 
miRNA analysis tools, such as miRExpress [17] and DSAP [18], 
have been developed based on this approach which could be used 
to determine miRNA expression profiles when genomic sequences 
are unavailable. Using other organisms' genomes as proxy 
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Mature miRNA in miRbase 



Figure 1. Number of reads from A. aegypti smRNA-seq data mapped to four selected species unique mature miRNA sequences in 
miRBase. 

doi:1 0.1 371 /journal.pone.0084747.g001 



reference to identify conserved miRNAs in non-model species has 
been approached by many research groups [19-22]. How well do 
miRNA discovery pipelines perform for miRNA discovery analysis 
from smRNA-Seq data in the absence of a sequenced genome? 

In this study, two mosquito small RNA library data from Aedes 
aegypti were aligned to two closely related (Anopheles gambiae and 
Drosophila meh.noga.ster) and one distantly related (Bombyx mori) insect 
genomes as proxy references to evaluate the accuracy of 
identification of known and novel miRNAs and their differential 



expression if the original genome sequence was not available. The 
outcomes of these analyses were validated by comparing the results 
when the original genome sequence was used as the standard 
reference genome. miRanalyzer and DSAP were selected for this 
study as they are user-friendly web servers with a short 
computational time and their overall approach towards miRNA 
detection has made them popular. The information in regards to 
the accuracy of miRNA discovery pipelines using closely related 
organisms' genomes could provide valuable knowledge for 




■A aegypti MA. gambiae D. melanogaster ■ B. mori 

Figure 2. Number of reads from A. aegypti smRNA-seq data mapped to four selected species sequence of pre-miRNA hairpins in 
miRBase. 

doi:1 0.1 371 /journal.pone.0084747.g002 
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Figure 3. Number of conserved or known miRNAs identified in A. aegypti smRNA-seq data by miRanalyzer and DSAP. A) The total 
number of identified miRNAs with strict and loose criteria by mapping to species unique sequences in miRbase. B) Number of overlap or common 
miRNAs between A. aegypti and other selected species. 
doi:1 0.1 371 /journal. pone.0084747.g003 



scientists interested in the distribution patterns of miRNAs or 
discovering new miRNAs in non-model species. 

Results and Discussion 

Mapping Small Reads to Databases and Prediction of 
Conserved miRNAs 

To identify known miRNAs in the libraries, we used the 
miRBase repository [23] which offers mature and precursor 
miRNA (pre-miRNA) sequences. As expected, more reads 
mapped to A. aegypti's original mature miRNA set in comparison 
with those of other species in both analysis pipelines. More than 
20% reduction was observed in the number of mapped reads using 
the strict criterion when other species miRNA sets were used as 
alignment proxy references (Fig. 1). The difference between the 
number of mapped reads when both mapping criteria were 
applied for the two closely related species, A. aegypti and A. gambiae, 
was around 20%, while applying the loose criterion (max 2 
possible mismatches) led to significant enhancement in the number 
of reads mapped to known mature miRNAs in D. melanogaster and 
B. mori datasets (Fig. 1). In this case, allowance of two mismatches 
increased the coverage of data processing and showed the opposite 
pattern when the genome sequences were used as proxy alignment 
references. 

The number of reads that were mapped to known mature 
miRNAs by DSAP was significantly less compared to miRanalyzer 
and this is due to inflexible aligning algorithm of DSAP (Fig. 1). 
The only reads that showed 100% identities were retained in 
DSAP, while miRanalyzer is able to analyse the isomers and 
potential variations in miRNA sequence. A software performance 
study showed DSAP and miRanalyzer keep the highest percentage 
of reads in the process of mapping among other miRNA discovery 
pipelines [24]. However, in the current analysis, the number of 
used reads demonstrated significant differences between the two 
pipelines. This indicates miRanalyzer utilized a larger portion of 
the available data for further analysis. miRanalyzer was developed 
as a sensitive learning algorithm to predict conserved and novel 
miRNAs with an AUC (Area Under Curve) of 97.9% and recall 
values of up to 75% on unseen data [25]. 

Figure 2 shows the number of reads mapped to known pre- 
miRNA sequences in miRBase, which in contrast to mature 
miRNA, using both mapping criteria, more reads were mapped to 
B. mori reference genome. Overall, the number of mapped reads 
when the strict criterion was applied, which allows just one 
mismatch in the reference dataset, was significandy lower than 
that of the loose criterion. Indeed, 83 and 13 times enrichment was 



observed in the number of mappable reads with the loose criterion 
in D. melanogaster and A. aegypti, respectively (Fig. 2). The pre- 
miRNA is more diverged than mature miRNA among different 
species and thus more mapped reads were expected with the loose 
criterion, which allows for more mismatches in the proxy reference 
genomes. Therefore, using loose criterion increased the coverage 
when a proxy reference was used and the results suggested that 
potentially there could be more homologous miRNAs in A. aegypti 
that have not been reported and remain to be discovered. 

The two types of pipelines used in this study had the highest 
mature miRNA prediction success in software performance tests of 
Homo .sapiens and C. elegans small RNA sequencing data [26] . In the 
current study, the number of predicted mature and pre-miRNAs 
were compared with species standard, which was defined as the 
total number of miRNAs for each species found in miRBase. 
When the reads were aligned to A. aegypti % reported miRNAs using 
miRanalyzer, sequences of 84 known mature miRNAs were 
identified with strict criterion, which is almost 67.7% of the 
reported miRNAs from this species, while choosing the loose 
criterion increased this value to 93.5%, which is 116 miRNAs 
(Fig. 3A; Table 1). 

In all cases, DSAP values were lower than those of miRanalyzer 
when strict criterion was applied. For example, only 5.1% of B. 
mori miRNA were identified by aligning A. aegypti small RNA reads 
to B. mori known miRNAs (Table 1). Using DSAP or strict 
criterion and other insect species data as proxy reference 
significandy reduced the number of detected miRNAs for A. 
aegypti small RNA library (Fig. 3A). Notably, using loose criterion 
dramatically increased the prediction values with B. mori and D. 
melanogaster proxy miRNA references (Fig. 3A). The results suggest 
that probably there are many insect conserved miRNAs which 
have not been reported from A. aegypti and their homologues could 
easily be identified in two genetically well annotated species B. mori 
and D. melanogaster. miRanalyzer detected 65.7% of A. gambiae 
mature miRNA sequences when A. aegypti small RNA reads were 
aligned to known A. gambiae miRNAs in miRBase, a value very 
close to its original species discovery rate (67.7% in A. aegypti). The 
data in Table 1 also show that the strict criterion provided around 
10% of known miRNAs in phylogenetically distant species D. 
melanogaster and B. mori, suggesting high levels of diversification in 
miRNAs evolution. The majority of those miRNAs (around 84%) 
were identified with loose criterion in the two species, which shows 
that there are more than three mismatches or differences between 
A. aegypti mature sequences with those of two phylogenetically 
distant species. High level of conservation is expected between D. 
melanogaster and the two other mosquitoes {Aedes and Anopheles); 



Table 1. Identification rate of known mature and precursor miRNAs from A. aegypti in four selected species. 





Species 


loose criterion 




strict criterion 




DSAP (Mature) 




Mature 


pre-miRNA 


Mature 


pre-miRNA 




Aedes aegypti 


93.5% 


84.2% 


67.7% 


30.7% 


62.1% 


Anopheles gambiae 


93.8% 


94.0% 


65.7% 


44.8% 


58.5% 


Drosophila melanogaster 


83.8% 


99.6% 


13.6% 


1 .3% 


8.4% 


Bombyx mori 


84.0% 


97.3% 


10.5% 


1.4% 


5.1% 



doi:1 0.1 371 /journal.pone.0084747.t001 
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Figure 4. Percentage of unique reads from A. aegypti smRNA-seq data aligned to genome sequences from four different insects 
using strict or loose criterion. 

doi:1 0.1 371 /journal.pone.0084747.g004 



however, previous comparative genomic analysis suggested that 
several fruit fly developmental genes could not be identified in 
mosquito genomes [27]. As a consequence, it is expected that a 
number of Drosophila miRNAs may not have orthologues in 
mosquitoes. 

Research indicates that evolution of miRNAs is an ongoing 
process and the continuing innovation of novel miRNA families in 
different organisms is not the only way of evolution in this group of 
small RNAs but also the diversification of established families 
producing additional paralogues of miRNAs [28] . The pair-wise 



sequence identity of paralogous pre-miRNA sequences are often 
below 50-60% while their mature miRNA sequences show high 
level of conservation [29]. Previous studies have shown that the 
terminal loop is the least conserved part of pre-miRNAs [21,30]. 
As shown in Table 1, with strict criterion, only less than 1.5% of 
pre-miRNAs were identified when D. melanogaster and B. mori 
genomes were used as proxy references. 

The number of conserved miRNAs between A. aegypti and other 
selected species are presented in Fig. 3B. The results revealed that 
using the strict criterion 41, 44 and 39 miRNAs were common 





450 






01 




4-* 


400 


;o 




T3 




c 
ro 


350 


(J 




< 
Z 


300 


DC 




E 


250 






> 
o 


200 


z 




o 

1_ 


150 


Ol 




Si 

£ 


100 






z 


50 




0 



! No Dicer Pattern 
Low Fluctuation 

I Diffuse Dicer Pattern 
Perfect Dicer Pattern 



A. aegypti 



A. gambiae D. melanogaster 



B. mori 



Genome Reference 

Figure 5. Number of novel miRNA candidates predicted in A. aegypti smRNA-seq data by mapping to four selected species 
genomes. "Perfect Dicer pattern: A perfect 3' 2 nt overhang exists for the most expressed read and not more than 3 read clusters do exist on the pre- 
miRNA (one for the mature, one for the mature* and one for the loop sequences). Diffuse Dicer patter: A3' 1-4 nt overhang exists for the most 
expressed read and not more than 3 read clusters do exist on the pre-miRNA. Low fluctuation: No Dicer pattern has been detected but only one read 
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Table 2. The top 20 most abundant A. aegypti miRNAs identified by miRanalyzer, when reads aligned to four species database. 



A. aegypti 




A. gambiae 




D. melanogaster 




B. mori 




strict 


loose 


strict 


loose 


strict 


loose 


strict 


loose 


miR-184 


miR-184 


miR-184 


miR-184 


miR-184-3p 


miR-184-3p 


miR-184-3p 


miR-184-3p 


miR-275-3p 


miR-275-3p 


miR-275 


miR-275 


miR-275-3p 


miR-275-3p 


miR-275-3p 


miR-275-3p 


miR-317 


miR-317 


miR-317 


miR-317 


miR-317-3p 


miR-317-3p 


miR-317-3p 


miR-317-3p 


miR-2940-3p + 


miR-2940-3p + 














miR-276-3p 


miR-276-3p 


miR-276-3p 


miR-276-3p 


miR-276a-3p 


miR-276a-3p 


miR-276-3p 


miR-276-3p 


miR-92b-3p 


miR-92b-3p 


miR-92b 


miR-92b 


miR-92b-3p 


miR-92b-3p 




miR-92b 


miR-2940-5p + 


miR-2940-5p + 














miR-281-5p 


miR-283 


miR-283 


miR-283 


miR-281-2-5p 


miR-281-2-5p 


miR-281-5p 


miR-281-5p 


miR-283 


miR-281-5p 


miR-2 


miR-2 


miR-283-5p 


miR-283-5p 


miR-283-5p 


miR-283-5p 


miR-989 


miR-989 


miR-989 


miR-989 


miR-989-3p 


miR-989-3p 


miR-2a-3p 


miR-2a-3p 


miR-305-5p 


miR-305-5p 


miR-305 


miR-305 


miR-305-5p 


miR-305-5p 


miR-989a 


miR-989a 


miR-34-5p 


miR-34-5p 


miR-34 


miR-34 


miR-2a-3p 


miR-2a-3p 


miR-305-5p 


miR-305-5p 


miR-2a-3p 


miR-2a-3p 






miR-34-5p 


miR-34-5p 


miR-34-5p 


miR-34-5p 


miR-998 


miR-998 






miR-998-3p 


miR-998-3p 


miR-998 


miR-998 


miR-14 


miR-14 


miR-14 


miR-14 


miR-14-3p 


miR-14-3p 


miR-14-3p 


miR-14-3p 


miR-92a-3p 


miR-92a-3p 


miR-92a 


miR-92a 


miR-92a-3p 


miR-92a-3p 




miR-92a 


miR-12-5p 


miR-12-5p 


miR-12 


miR-12 


miR-12-5p 


miR-12-5p 


miR-12 


miR-12 


miR-11-3p 


miR-11-3p 


miR-11 


miR-11 


miR-11 -3p 


miR-11-3p 


miR-11 -3p 


miR-11 -3p 


miR-1889-5p + 


miR-1889-5p + 








miR-SIO-Sp** 


miR-2779 ++ 


miR-2779 4 ^ 


miR-306-5p 


miR-263a-5p 


miR-263a 


miR-263a 


miR-263a-5p 


miR-263a-5p 


miR-263a-5p 


miR-306a-5p 






miR-988^ 


miR-306 


miR-306-5p 


miR-306-5p 


miR-71-Sp** 


miR-263a-5p 






miR-970^ 


bantam 4 ^ 




miR-304-5p ++ 


bantam-Sp 4 ^ 


miR-71-Sp 4 ^ 






bantam 4 ^ 


miR-988* 4 


bantam-Sp 4 ^ 




miR-970-3p ++ 








miR-87 4 ^ 








miR-252-5p ++ 








miR-8 4 ^ 


miR-87 4 ^ 











+ A. aegypti species-specific miRNAs. 

^he miRNAs that appear at top 20 list when proxy references were used. 
doi:1 0.1 371 /journal.pone.0084747.t002 



between A. aegypti and A. gambiae, D. melanogaster and B. mori, 
respectively. This overlap in detected miRNAs between species in 
each analysis suggests that several undescribed miRNAs poten- 
tially remain to be discovered in A. aegypti. The first A. aegypti 
miRNA repository was reported in 2009 by mapping 545 
pyrosequencing data to the mosquito's genome using the BLAST 
algorithm, which probably led to missing many potential miRNAs 
due to the allocated mapping criteria [31]. 

The results from this study also suggests high levels of 
conservation or similarity in miRNA repertoires among phyloge- 
netically close species; for example 56 out of 61 A. gambiae miRNAs 
(~92%) were identified with A. aegypti small RNA library while this 
value reduced to 13% (64 out of 474) in the phylogenetically 
distant species B. mori when loose criterion was applied (Fig. 3B). In 
addition, it has been reported that most mosquito miRNAs are 
conserved across divergent species with only 1 1 distinct miRNA 
genes to be mosquito-specific [31-33]. However, we only recalled 
around 50% of A. aegypti miRNAs when another mosquito's 
genome (A. gambiae) or known miRNA dataset was used as proxy 
reference (Fig. 3B). 



Mapping Small RNAs to Genome References and 
Prediction of Novel miRNAs 

As expected, the number of reads from the A. aegypti small RNA 
libraries that mapped to the A. aegypti genome sequence were 
significantly higher than matched reads to other proxy genome 
references. When the loose criterion was applied, the percentage of 
unique reads matched to the genome increased (Fig. 4). A laxer 
mapping to miRNAs or other libraries such as Rapbase, Rfam will 
remove more reads prior to the mapping to the genome. 
Therefore, less reads are mapped to the genome as they are 
removed at earlier stages. However, when other organism's 
genomes were selected as proxy references, the number of non- 
matched reads considerably increased once the strict criterion was 
applied. The sensitivity of the Burrows-Wheeler Transform 
(BWT)-based algorithms such as Bowtie, which is used in 
miRanalyzer, decreases exponentially with the number of 
mismatches in genome reference. Improvements of mapping 
criteria are essential for the analysis of small RNA reads when 
proxy reference genomes are utilized for mapping purposes 
[34,35]. DSAP is not able to perform this task because it is only 
designed for the identification of known miRNAs, which is 
independent of a complete genome sequence. 

As mentioned above, identification of more potential novel 
miRNAs in A. aegypti was expected since in the previous step it was 
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Table 3. The top 20 most abundant A. aegypti miRNAs 
identified by DSAP, when reads aligned to four species 
database. 



A. aegypti 


A. gambiae 


D. melanogaster 


B. mori 


miR-184 


miR-184 


miR-184 


miR-275 


miR-275 


miR-275 


miR-275 


miR-276-3p 


miR-317 


miR-276-3p 


miR-276a 


miR-281-5p 


miR-2940-3p 


miR-92b 


miR-281-2-5p 


miR-305 


miR-276 


miR-989 


miR-305 


miR-14 


miR-92b 


miR-305 


miR-998 


miR-34 


miR-2940-5p 


miR-34 


miR-14 


miR-263a 


miR-989 


miR-14 


miR-12 


miR-252 


miR-305 


miR-12 


miR-11 


miR-2a 


miR-11-3p 


miR-11 


miR-970 


miR-8* 


miR-14 


miR-92a 


miR-988 


miR-190 


miR-12 


miR-970 


miR-252 


miR-184 


miR-34 


miR-988 


miR-34 


miR-277 


miR-11 


miR-2 


bantam 


miR-7 


miR-92a 


miR-13b 


miR-2a 


miR-100 


miR-1889-5p 


miR-278 


miR-13b 


let-7 


miR-71 


miR-279 


miR-278 


miR-10 


miR-263a 


miR-190 


miR-279 


miR-1000 


miR-970 


miR-277 


miR-9b 


miR-11 


bantam-3p 


miR-7 


miR-190 


miR-279c 


doi:1 0.1 371 /journal.pone.0084747.t003 



found that a high number of reads mapped to well-annotated 
miRNAs from D. melanogaster and B. mori. Accordingly, miRana- 
lyzer detected more novel miRNAs in A. aegypti when the 
mosquito's genome was used as reference, compared with when 
other organisms' genomes were used as proxy references. These 
miRNAs were classified into four groups (Perfect Dicer pattern, 
Diffuse Dicer, low fluctuation and no Dicer patterns) based on the 
secondary structure of pre-miRNAs and the read alignments 
(Fig. 5). In all the four selected genomes, most of the novel 
miRNAs were categorized in low fluctuation and no Dicer 
patterns groups. Indeed, identification of novel miRNAs needs 
experimental validation but this in silico prediction confirmed that 
using the original species genome, the chance to identify new 
miRNAs is increased. Detection of fewer novel miRNAs, when 
small RNAs were mapped to D. melanogaster and B. mori genomes, is 
probably due to identification of many miRNAs that are already 
known in these species. In other words, some portions of miRNAs 
which were considered as novel (not reported) miRNAs in A. aegypti 
in this analysis have already been known or reported as 
homologues in other species. 

Most of the novel miRNA candidates, which were identified by 
miRanalyzer, had very low copy numbers. Comparison of 
miRanalyzer with other similar web-based tools suggested that 
miRanalyzer is better suited to detect low-expressed novel miRNA 
candidates, and novel candidates represented by low abundant 
reads may not be excluded from library using its algorithm [24] . 

Identification of Abundant miRNAs 

The top twenty most abundant miRNAs in A. aegypti small RNA 
libraries were identified by miRanalyzer (Table 2) and DSAP 
(Table 3) through mapping reads to the four selected species 



references. Using loose or strict criteria did not make any 
significant differences in the profile of the most abundant miRNAs 
in A. aegypti compared with when its original genome was used as 
reference. miR-184, miR-275-3p, miR-317 were the most 
abundant miRNAs in all the different analyses. However, the 
important A. aegypti'% species-specific miRNA miR-2940 was 
missed when other species data were used as proxy references 
(Table 2 and 3). 

A. aegypti mature miRNA database as a reference showed high 
level of similarities between DSAP and miRanlayzer in the top 10 
highly expressed miRNAs ranking. However, using other organ- 
isms' known miRNA as reference made more variation in the list 
when DSAP was compared with miRanalyzer. The ranking profile 
produced by miRanalyzer is more reliable because of its flexible 
mapping algorithm and higher data coverage. For example, when 
B. mori was selected as a reference, miR- 184, which is a very highly 
expressed miRNA in most cases, was not allocated in the top 1 0 by 
DSAP (Table 3). 

In general, a significant proportion of miRNAs lack homologues 
among other species, which is likely due to species-specific 
adaptations. They are potentially the most interesting aspects of 
a species miRNA evolution, but they could simply be missed 
during annotation using inappropriate discovery pipelines and 
reference genomes. For example, A. aegypti species-specific miRNA 
aae-miR-2940 is one of the highly expressed miRNAs, which was 
only identified when A. aegypti % genome was used as reference. The 
miRNA plays important roles in the maintenance of the 
endosymbiont Wolbachia in the mosquito and is involved in 
inhibition of replication of dengue virus in Wolbachia-'mfected A. 
aegypti [36,37]. 

Differentially Expressed miRNAs 

In many instances, highly differentially expressed miRNAs 
under different treatments are of interest to researchers to find 
their biological functions in non-model species. To examine the 
impact of the species used as a genome reference to determine 
differentially expressed miRNA profiles, we used small RNA 
libraries from A. aegypti Wolbachia-mStcted and non-infected Aag2 
cells for comparison [38]. miRNAs with an average read number 
of less than 10 were discarded from DESeq analysis output file and 
then the top 40 extremely up- and down-regulated miRNAs were 
selected for comparison (Fig. 6). When the strict criterion was used, 
around 27 out of 40 miRNAs were common between analyses 
when the genomes of A. aegypti and other species were used as 
references. In other words, using other organisms' genomes as 
proxy references could provide about 67% of the outcome when 
the original (A. aegypti) genome was used as reference. However, 
this value was significantly reduced in phylogenetically distant 
species B. mori (17.7%) when loose criterion was applied (Fig. 6). 

Differential expression values of 10 most abundant A. aegypti 
miRNAs in different analyses are presented in Table 4. Although 
in some cases the values are different in each analysis, the overall 
patterns are very similar. The results suggested that using other 
organisms' genomes as reference did not change the most 
abundant miRNA differential expression profile which was 
calculated based on log 2 fold change. Although DSAP provides 
visualization interfaces for differential mature miRNA expression 
level, reporting non-normalized data is the main limitation of 
DSAP for this task. Due to this limitation, the value for log 2 fold 
changes in each analysis might vary with its corresponding value in 
miRanalyzer. 
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Strict Criterion 



A aegypti a. gambiae A. aegypti D. melanogaster A. aegypti B. mori 



13 27 13 14 26 14 13 27 13 



Loose Criterion 



A aegypti a. gambiae A. aegypti D. melanogaster A aegypti B. mori 

/ \ x \ / \ / ▲ 

15 25 15 25 15 25 33 7 33 



Figure 6. Unique highly differentially expressed miRNAs in Aag2 cell with and without Wolbachia infection when genomes of 
different selected insect species were used as references. Overlap areas show the number of common miRNAs in each comparison with high 
level of fold changes (more than 2). 
doi:1 0.1 371 /journal.pone.0084747.g006 



Conclusions 

Analyzing small RNA-Seq data for miRNA discovery has 
classically required genomic sequences from the species of interest 
in order to map the reads to the reference genome. In the absence 
of genome sequences in many non-model organisms, a number of 
tools have been developed based on other species' conserved 



miRNA datasets or other closely related species genome sequences 
as proxy references; however, the accuracy of these approaches 
have not been thoroughly evaluated; although there have been 
studies analysing the accuracy of quantitative RNA-Seq for gene 
expression in non-model species including de novo transcriptome 
assembly and the use of non-target species as reference scaffolds 
[39]. These studies have illustrated that using closely related 



Table 4. Differential expression value of 10 most abundant A. aegypti miRNAs in different analyses when other insects data were 
used as reference. 



miRNA 


A. aegypti 






A. gambiae 




D. melanogaster 




B. mori 








Loose 


Strict 


DSAP 


Loose 


Strict 


DSAP 


Loose 


Strict 


DSAP 


Loose 


Strict 


DSAP 


miR-184 


-0.5547 


-0.5328 


-0.42 


-0.5355 


-0.5343 


-0.25 


-0.3218 


-0.5075 


0.42 


-0.3062 


-0.5349 


-1.77 


miR-275-3p 


-0.9623 


-0.9398 


-0.15 


-0.9430 


-0.9411 


-0.75 


-0.7293 


-0.9143 


0.75 


-0.7131 


-0.941 1 


-0.75 


miR-317 


- 1 .0387 


-1.0176 


-0.71 


-1.0261 


-1.0275 


+00 


-0.7996 


-0.9923 




-0.7897 


-1.0191 




miR-2940-3p 


0.2670 


0.2888 


0.56 




















miR-2940-5p 


0.5286 


0.5509 


0.78 




















miR-276-3p 


-0.4760 


-0.4537 


-0.11 


-0.4568 


-0.4551 


-0.12 


-0.2437 


-0.4287 


0.12 


-0.2269 


-0.4551 


-0.12 


miR-92b-3p 


-0.7702 


-0.7470 


— 00 


-0.7510 


-0.7484 


-0.41 


-0.5393 


-0.7172 




-0.5189 






miR-283 


0.3487 


0.3705 


1.31 


0.3712 


0.3872 




0.5826 


0.4100 




0.601 1 


0.3872 




miR-281-5p 


- 1 .2454 


-1.2243 




0.4644 


0.6258 




-1.0115 


-1.1960 


0.90 


-0.9939 


-1.2214 


-0.90 


miR-305-5p 


-0.0195 


0.0014 


0.19 


0.0000 


0.0000 


0.25 


0.2137 


0.0268 


-0.25 


0.2299 


0.0000 


0.25 


Differential expression 


was calculated based on 


log 2 fold 


change. 



















doi:1 0.1 371 /joumal.pone.0084747.t004 
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species genome as reference incurs some losses in the number of 
predicted sequences and their expression data; in particular, the 
evolutionary distance between species introduces more biases and 
errors [39,40]. 

Our findings indicated that the most abundant and conserved 
miRNAs can successfully be identified from a non-model species 
smRNA-Seq data by using closely related species genome 
references, but using a proxy genome reference does not lead to 
the identification of the whole miRNA profile in species without 
complete sequenced genome. In addition, this approach provides a 
robust starting place for the identification of differentially 
expressed miRNAs which are often of great interest to researchers 
when comparing samples from cells or organisms under different 
treatments (e.g. infected and non-infected). The overall pattern of 
differential expression for miRNAs with high copy numbers did 
not show any significant changes when other spehcies genomes 
were used as proxy references. Accordingly, around 67% of 
extremely up- or down-regulated miRNAs were identified by using 
strict criterion and other species genomes as proxy references. 
However, similar to transcriptome data studies, when the genome 
of phylogenetically distant species was used as reference, the 
number of identified differentially expressed miRNAs was reduced 
to around 13% when loose criterion was applied. 

Methods 

Dataset Preparation 

Two small RNA libraries were generated from two Aedes aegypti 
(Diptera; Culicidae) Aag2 cell line samples using the Illumina 
Truseq™ Small RNA Preparation kit at LC Sciences Company 
(Houston, USA). The purified cDNA libraries were sequenced on 
Illumina GAIIx and raw sequencing reads (36 nts) were obtained 
using Illumina's Sequencing Control Studio software version 2.8 
followed by real-time sequencing image analysis and base-calling 
by Illumina's Real-Time Analysis version 1.8.70 (LC Sciences, 
Houston, USA). Two datasets with 1,409,306 and 3,347,907 raw 
reads were obtained from deep sequencing and a tab separated file 
with the read sequences and its counts were used as input file for 
miRanalyzer [41] and DSAP [18]. 

miRNA Analysis Workflow 

All reads with 'N' in their sequences and also those shorter than 
17 bases or longer than 26 bases were removed from our datasets. 
To detect the number of known miRNAs, the filtered reads were 
aligned to the corresponding species miRNA sequences in 
miRBase and also they were mapped to the proxy genome 
references for predicting novel miRNAs. 

An updated version of miRanalyzer, a web based server for the 
detection of known and prediction of novel miRNAs was used as 
the main pipeline for this analysis. This software is based on a 
random forest classifier and implements a highly accurate machine 
learning algorithm (Support Vector Machine) to predict new 
miRNA candidates from high throughput sequencing data [25]. 
The ultrafast short read aligner Bowtie was used to align the reads 
to the genomes and miRNA database (miRBase v. 19). DSAP, a 
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